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SPECIFICATION 

Improvements in or relating to acoustic re* 
cognition 

5 

The present invention relates to a method of 
acoustic recognition and in particular to a 
method for improving the performance of di- 
rect voice input/automatic voice recognition 
10 systems such as voice recognition systems 
utilising time encoded speech. 

Voice recognition systems are known. GB 
Patent Application No. 8323481 describes a 
method and system for recognising voice sig- 
15 nals using time encoded speech. The IEEE 
Transactions on Communications Vol. 
COM-29, No5 of May 1981 provides a fuller 
description of developments in this art. 
For many voice recognition applications 
20 "templates" or "archetypes" of a complete 
set of words (or other acoustic events) to be 
identified or recognised are built up during a 
training phase and stored in a template or 
archetype store. During the recognition phase, 
25 as words are spoken into the recogniser the 
input utterance is compared with each mem- 
ber from the set of templates/archetypes in 
the store. Some form of "match" is com- 
puted and the closest match is identified by 
30 the machine as the word spoken. 

Similar words, for Example "No" and "go", 
"six" and "fix", may cause confusion and 
misrecognition. It is accepted that, in general, 
the larger the total set of words to be recog- 
35 nised, the greater the likelihood of confusion 
between some words in the set. It has also 
been observed that the numerical set of 
words zero to nine are very difficult to recog- 
nise with accuracy and consistency, since 
40 there is very little acoustic material for an 
automatic recogniser to operate upon. Some 
pairs of words, for example, "nine" and 
"five" may be consistently misrecognised. 
Similarly, the individual letters "b", "c", "d" 
45 and "e ' may also be consistently misrecog- 
nised from the alpha set of words. This mis- 
recognition/confusion is also characteristic of 
other sets of utterances. This is particularly 
disadvantageous, since most useful applica- 
50 tions for direct voice Input or voice recogni- 
tion need to employ the ubiquitous numeric 
set (or the alpha-numeric set) of words. In 
telephone dialling, for example, or in the selec- 
tion of items from a menu presented to the 
55 user, the numeric set of words is an easy and 
familiar vehicle for most people to use. Other 
acoustic tokens are not in general so readily 
acceptable without considerable training. 
The problem described above is further 
60 compounded by the fact that different speak- 
ers pronounce the alpha-numeric set in differ- 
ent ways. Also, changes in speaker diction 
caused by stress, illness, microphone place- 
ment, background noise, etc. produce con- 
65 siderable variation between recognition scores 



for identical words spoken by the same 
speaker on different occasions or in different 
circumstances. This may result in confusion 
between different pairs of words in different 

70 circumstances. 

It is an object of the present invention to 
provide a method of assigning individual mem- 
bers from a selected acoustic sub-set such as 
the numeric (or alpha-numeric) sub-set of sym- 

75 bols in such a way as to minimise confusion 
between these symbols when spoken as in- 
puts to an automatic voice recognition sys- 
tem, and thus significantly to improve the per- 
formance of the system in most realistic situa- 

80 tions. 

Accordingly, there is provided a method for 
improving the performance of an acoustic re- 
cognition system, the method comprising, for 
a set of acoustic events, coniparing at least 
85 one acoustic event with itself or an archetypal 
description of itself and/or any other acoustic 
event of the set or an archetypal description 
of any other acoustic event to establish de- 
grees of confusion between the compared 
90 acoustic events, or archetypal descriptions, 
and disregarding one or more acoustic events 
from the set in dependence upon the estab- 
lished degrees of confusion thereby to provide 
a preferred set of acoustic events for recogni- 
95 tion by the acoustic recognition system. Pre- 
ferably, every acoustic event of the set is 
compared with itself, or an archetypal descrip- 
tion thereof, and every other acoustic event of 
the set, or archetypal descriptions thereof. 
100 Advantageously, the method comprises, in 
dependence upon the comparison of acoustic 
events of the set, or archetypal descriptions 
thereof, formulating a comparison matrix of 
comparison scores, each comparison score 
105 being indicative of the degree of confusion be- 
tween an acoustic event when compared with 
itself or another acoustic event of the set, and 
establishing from the comparison matrix a 
ranking order for the compared acoustic 
110 events, which ranking order is indicative of the 
preferred order for disregarding acoustic 
events from the set to provide the preferred 
set of acoustic events, and disregarding one 
or more acoustic events in accordance with 
115 the established ranking order. 

In a preferred embodiment the method com- 
prises summing the rows or columns of the 
comparison matrix to establish a summed 
value of comparison scores for rows or col- 
120 umns of the comparison matrix, identifying the 
comparison scores corresponding to these 
compared acoustic events which give rise to 
the greatest or least degrees of confusion, 
subtracting these identified comparison scores 
125 from the sammed values of comparison 

scores for each row or column after this sub- 
traction to establish total comparison scores 
for the comparison matrix and comparing the 
total comparison scores to establish the iden- 
130 tity of an axoustic event to be disregarded 
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from the set of acoustic events. The method 
may comprise deleting from the comparison 
matrix those comparison spores corresponding 
to the comparison of the li^entified acoustic 
5 event with the other acoustic events of the 
set, summing the rows of columns of the re- 
maining comparison scores to establish a 
symmed vaiiue of comperisbn scores for each 
remaining row or column. Identifying the re- 
10 maining comparison scores indlcMing the 
greatest or least degree of remaining confu- 
sron, subtracting those identified remaining 
comparison scores from the summed values 
for the rows or columns in which these identi- 
15 fled remaining comparison scores occur, 
summing the summed values of comparison 
scores for each remaining row or column of 
the comparison matrix after this subtraction to 
es^blish total comparison scores for the re- 
20 mgining comparison matrix, and comparing the 
total comparison scores for the remaining 
comparison matrix to establish the identity of 
9 further aqoustic event to be disregarded 
from the set of aoowstic events. 
25 The method steps rnay be repeatd to estab- 
lish the ranking order for all the compared 
acoustic events of the set. 

One or more compared acoustic events may 
be dismgarded from the set in dependence 
30 upon the established ranking order. 

Preferably, any acoustic event disregarded 
from the set is substituted by an acoustic 
event of equivalent meaning to the disre- 
garded acoustic event, but which results in 
35 less confusion. 

The acoustic events, before comparison, 
may be encoded using time encoded speech. 

The method of the present invention may 
be utilised in a tefephone dialling system. 
40 The present invention will now be de- 
scribed, by way of example, with reference to 
the accompanying drawings in which: 

Figures 1 to 4 illustrate a matrix of compari- 
son scores representing acoustic events from 
45 an acoustic set; and. 

Figures 5 to 7 illustrate how the method of 
the present Invention may be applied in prac- 
tice to a telephone system. 

In the following description, the numeric 
50 sub-set zero to nine is used as an example, 
and a telephone dialling systm is used as the 
vehicle for illustration. However, it will be real- 
ised that any other acceptable acoustic set in 
ay language may advantageously be treated in 
55 the same way, for this or for other similar 
applications. It is also assumed in the follow- 
ing examples that the system utilised has 
been trained on the set of words zero to nine, 
and that the system contains a template/ar- 
60 chetype or other reference model for each 
word zero to nine. 

Referring to the drawings, each of the tem- 
plates/archetypes is compared and scored 
against itself, and against each of the other 
65 templates/archetypes in the template/arche- 



type store. The results of these comparisons, 
that is to say the comparison scores, are re- 
corded in a Comparison Matrix. (CM), an 
example of which is shown in Fig. 1. 

70 In this example, it will be seen that templa- 
tes/archetypes of the words compared against 
thini^elves give a 1 {100%) comparison score 
and that throughout the matrix, comparison 
sic^res vary from very low values, such as 

75 zero between for example the words one and 
six (indicating little similaFtty or no similarity 
between the archetypes of these words), to 
higer values such as 0.805 (80.5%) between 
the words one and four (indicating 80.5% 

80 sitnilaril^ between the arehetypes of these 
two words). The high degr^ df similarity be- 
tween archetypes of these words indicates 
that considerable confusion may arise between 
the words one and four during any recognition 

85 process with this system and under these 
conditions. 

An examination of the values in the Com- 
paflson tldil^ix shown in Pig. 1 reveals a com- 
^f4son SDore of 0^80tS for i^e words one>*and 

90 fi^m, 0i^4'4 for the w0Fds six and seven, 
01S39 ^r the words live and nine, 0.522^fbr 
the words zero and three; and so on. It wiil 
be apprieciated that, alternatively, this examina- 
tion could also proceed from the lowest com- 

95 parisfon score upwards, rather than from the 
highest downwards. j- 
From this information it may be deduce#^ 
that the exclusion of the words "one" and 
"four" from the NUMERIC sub-set would re- 
100 move the major source of confusion from the 
sub-set and make the remaining numbers eas- 
ier for the machine to recognise. According to 
the present invention, the camparlson score 
indicating the highest degree of confusion be- 
105 tween any two numbers in the sub-set is 
noted and the numbers responsible for this 
score Identified. Total comparison scores for 
the numbers concerned are recorded (row to- 
tals), less 100% entries that is, the entries 

110 arising from the comparison of numbers with 
themselves. Total comparison scores for each 
row of the matrix are then summed, excluding 
successively the scores of those events re- 
sponsible for the highest confusion. In the cur- 

115 rent example, as shown in Diagram 2, the 

comparison scores of the whole matrix would 
be summed row by row excluding the com- 
parison scores generated by comparing the 
word "one" with the other words in the num- 

120 eric sub-set (the "one" scores), and recorded. 
These sums are shown by the column desig- 
nated (—1) in Fig. 2. Likewise, the comparison 
scores of the whole matrix, excluding the 
"four" scores would be summed and re- 

125 corded as shown by the column designated 
(—4) in Fig. 2. The utterance giving the low- 
est sum when excluded, would then be la- 
belled for removal first from the utterance to 
be used. 

130 In the example shown, it can be seen that 
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removing "one" from the comparison matrix 
results in a total confusion of 9.574, removing 
"four" results in a confusion of 9.725. There- 
fore, to advantage, the word "one" should be 
5 the first utterance considered for removal from 
the numeric subset. 

The confusion matrix is then re-examined 
with all "one" entries expunged (see Fig, 3) 
and then process is repeated until a ranking 
10 ordering for all the utterances in the numeric 
subset is produced; Fig. 4 showing the last 
utterance for removal from the set. In the 
example shown by the matrix of Fig. 1, the 
complete ordering would be 1,7, 5, 0, 3, 8, 

15 1, 4, 6. 9. 

The ranking order obtained by this method 
will be course vary from speaker to speaker, 
system to system, day to day etc., and will 
be computed and available whenever is re- 

20 quired, for example during the training or re- 
training phase of the recogniser. It may be 
computed as a result of procedural imperative, 
or it may be initiated by the user if for 
example the noise in the room changes, the 

25 user has a cold, a tooth removed, is forced to 
wear a face mask or respirator, etc, etc. 

Having established those utterances which 
are likely to produce most confusion in de- 
creasing order, then this information is applied 

30 to the application under consideration. In the 
example illustrated in Figs. 5 to 7, a telephone 
dialling system is used but many other appli- 
cations commend themselves to this method. 
In the telephone dialling system under con- 

35 sideration, three modes of operation are pos- 
tulated: 

(a) pre-stored number facility; 

(b) free-dialling facility; 

(c) listed number facility. 

40 In the "pre-stored number" facility the name 
or location of the destination is pre-stored, 
and speaking the name or location to the 
automatic recogniser will cause the number to 
be dialled- This facility would be reserved for, 

45 say, the 10 most frequent numbers used by 
the subscriber. 

In the "free dialling mode" the individual 
numbers would be input by the user to form 
the telephone number for dialling. This facility 

50 would be reserved for the most infrequent 
numbers used by the subscriber, since it is 
time-consuming and more prone to error. 

In the "listed number" facility, a list of com- 
mon telephone numbers used frequently by 

55 the subscriber may be displayed on a screen 
and labelled with a two or three-digit number 
prefix to indicate the acoustic input required to 
identify the number or person to be dialled, as 
shown in Fig. 5. 

60 In most applications, forty to sixty numbers 
per page of the list will be sufficient to pro- 
vide an effective grade of service. According 
to the present invention, the total number of 
entries on the list is recorded and each entry 

65 is assigned a "D" digit number to the smal- 



lest base "N", where N is less than or equal 
to ten. (In the example given, D==2 or 3.) 

It can be seen that for a 64 entry list, under 
normal circumstances, and with a full numeric 
70 set, all the digits zero to nine will be used in 
pairs for each entry, and, in the example 
given, maximum confusion will occur between 
utterances 1, 4, 6, 7, etc., etc., as illustrated 
in Fig. 1. 

75 If it is desirable to restrict the digit prefix 
per list entry to two, then the smallest pos- 
sible base for the 64 entry list is "8" and 
two digits from a base 8 set of acoustic 
events will provide a unique description for 

80 each entry on the list. 

From the ordering obtained by examination 
of the confusion matrix, it can be seen that 
the exclusion of the numbers one and Six will 
result in a ("minimum confusion") base 8 set 

85 of numbers, viz: O, 2, 3, 4, 5, 7, 8, 9. These 
minimum confusion set of numbers may then 
be assigned in order to the 64 entries in the 
list as shown in Fig, 6. Such an allocation will 
minimise confusion and significantly improve 

90 the performance of the voice recognition sys- 
tem. 

It will be appreciated that if a three digit 
prefix for each entry on the list is acceptable, 
then base 4 descriptors may be used. Under 
95 these circumstances, four only symbols from 
the set zero to nine, may be assigned to the 
individual members of the list resulting iri the 
least possible confusion via the base 4 set of 
numbers 2, 4, 6, 9, as shown in Fig. 7,' 

1(X) In practice, some allowance for list expan- 
sion may be included and, for a 3 digit prefix, 
a base 5 descriptor would probably be speci- 
fied for a list of up to 64 entries in case the 
list is subsequently expanded to up to 125 

105 entries. 

By this method, whatever the features or 
characteristics of the recognition system, the 
acoustic environment or the variability of the 
individual user, the system Is optimised to 

110 produce maximum recognition performance of 
the available acoustic set irrespective of the 
speaker or system or environmental or other 
variations. 

It will be appreciated that the comparison 

115 matrix described is merely one of many which 
may be deployed, dependent upon the charac- 
teristics of the recognition system architecture. 
If, for instance, sets of individual tokens are 
stored to form the template/archetype for 

120 each of the words in the token set to be 
recognised, then the scores entered in the 
comparison matrix may be the "maximum" 
from the token set or the mean of the scores 
from the token set, and may include the 

125 spread or variation of individual scores, and 
any or all of these may be updated during use 
if necessary. All that is required is a meaning- 
ful comparison score relative to the recogni- 
tion system architecture into which the pre- 

130 sent method is incorporated, which will permit 
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the successive exclusion of those utterances 
causing maximum confusion in the total set of 
utterances being used. 
It will also be appreciated that by these 
5 means a repognition system can be given a 
capability of adapting to background noise 
conditions. In the example of the listed num- 
ber facility if the background noise is polled 
by the system and found to be greater than a 
10 specified level, the "P" parameter which de- 
fines the number of digits in the prefix, can be 
made larger, so that fewer individual utter- 
ances or acoustic events may be used in the 
feature set. 

15 In other applications the vocabulary may be 
automatically restricted or expanded to re- 
spond to the background noise conditions. 

The comparison matrix and comparison 
scores together with resutting comparison 

20 steps, may be generated and stored in any 
form of apparatus suitabte for this purpose, 
such as an appropriately programmed com- 
puter. 

It should also be appreciated that other 

25 critena may be employed to establish the 
ranking order for removal of utterances from 
the sub set whilst remaining within the scope 
of the invention. For example, at any stage in 
establishing the ranking order for removal, the 

30 pair of utterances giving rise to the highest 
degree of confusion may be recognised, as 
described with refspect to Figs. 1 to 4, How- 
ever, even though one of this pair may give 
rise to the lowest sum of comparison scores 

35 when excluded from the sub set it may be 
realised that the other utterance of the pair 
also has a relatively hjgh degree of confusion 
with one or more other utterances and hence 
may be chosen for removal even though a 

40 slightly higher sum of comparison scores 
would remain. 

CLAIMS 

1 . A method for acoustic recognition, the 
45 method comprising, for a set of acoustic 

events, comparing at least one acoustic event 
with itself or an archetypal description thereof 
and/or any other acoustic event of the set or 
an archetypal description thereof thereby to 

50 establish degrees of confusion between the 
compared acoustic events or achetypal de- 
scriptions, and disregarding one or more 
acoustic events from the set in dependence 
upon the established degrees of confusion 

55 thereby to provide a preferred set of acoustic 
events for recognition by an acoustic recogni- 
tion system. 

2. A method according to claim 1 wherein 
every acoustic event of the set is compared 

60 with itself, or an archetypal description 

thereof, and every other acoustic event of the 
set, or archetypal descriptions thereof. 

3. A method according to claim 1 or claim 
2 comprising, in dependence upon the com- 

65 parison of acoustic events of the set, or ar- 



chetypal descriptions thereof, formulating a 
comparison matrix of comparison scores, each 
cpmparison score being indicative of the de- 
gree of confusion between an acoustic event 

70 when compared with itself or another acoustic 
event of the set, and Establishing from the 
comparison matrix a ranking order for the 
confipared acoustic events, which ranking or- 
der is indicative of the preferred order for dis- 

75 regarding acoustic everits from the set to pro- 
vide the preferred set of acoustic events, and 
disregarding one or more acoustic events in 
accordance with the established ranking order. 
4. A method accordlr^g to claim 3 compris- 

80 ing summing rows or columns of the compari- 
son matrix to ea^bilsh a summed value of 
comparison scores for rows or columns of the 
comparison matt^x, ldenti%lng the comparison 
scores corresponding to those compared 

85 acoustic events which give rise to the greatest 
or least degrees of confusion, subtracting 
these identified comparison scores from the 
summed vaitues of comparison scores for each 
row or cotiimn a^er this subtraction to estab- 

90 lish total compa^on scores for the compari- 
son matrix and comparing the total compari- 
son scores thereby to establish the identity of 
an acoustic event to be disregarded from the 
set of acoustic events. 

95 5. A method according to claim 4 compris- 
ing deleting from the comparison matrix those 
comparison scores correspoirdlng to the com- 
parison of the identified acoustic event with 
the other acoustic events of the set, summing 
100 the rows or columns of the remaining compar- 
ison scores to establish a summed value of 
comparison scores for each remaining row or 
column, identifying the remaining comparison 
scores indicating the greatest or least degree 
105 of remaining confusion, subtracting these iden- 
tified rematning comparison scores from the 
summed values for the rows or columns in 
which these identified remaining comparison 
scores occur, summing the summed values of 
110 comparison scores for each remaining row or 
column of the oomparison matrix after this 
subtraction to establish total comparison 
scores for the remaining comparison matrix 
thereby to establish the identity of a further 
115 acoustic event to be disregarded from the set 
of acoustic events. 

6. A method according to claim 5 compris- 
ing repeating the method to establish the 
ranking order for all of the compared acoustic 

120 events of the set. 

7. A method according to any of claims 3 
to 6 wherein one or more compared acoustic 
events are disregarded from the set in depen- 
dence upon the established ranking order. 

125 8. A method accordingly to claim 7 

wherein any acoustic event disregarded from 
the set is substituted by an acoustic event of 
equivalent meaning to the disregarded acoustic 
event. 

130 9. A method according to any one of the 



preceding claims wherein the acoustic events 
comprise words. 

10. A method according to any of the pre- 
ceding claims wherein the acoustic events 

5 comprise phonemes. 

11. A method according to any of the pre- 
ceding claims wherein the acoustic events be- 
fore comparison, are encoded using 'time en- 
coded speech (TES). 

10 12. A method substantially as hereinbefore 
described with reference to the accompanying 
figures. 
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